#pandas reindex
Explore tagged Tumblr posts
kumarspark · 7 months ago
Text
0 notes
shilkaren · 5 years ago
Photo
Tumblr media
pandas reindex
Insideaiml is one of the best platforms where you can learn Python, Data Science, Machine Learning, Artificial Intelligence & showcase your knowledge to the outside world.
0 notes
t-baba · 8 years ago
Photo
Tumblr media
Pandas: The Swiss Army Knife for Your Data, Part 2
This is part two of a two-part tutorial about Pandas, the amazing Python data analytics toolkit. 
In part one, we covered the basic data types of Pandas: the series and the data frame. We imported and exported data, selected subsets of data, worked with metadata, and sorted the data. 
In this part, we'll continue our journey and deal with missing data, data manipulation, data merging, data grouping, time series, and plotting.
Dealing With Missing Values
One of the strongest points of pandas is its handling of missing values. It will not just crash and burn in the presence of missing data. When data is missing, pandas replaces it with numpy's np.nan (not a number), and it doesn't participate in any computation.
Let's reindex our data frame, adding more rows and columns, but without any new data. To make it interesting, we'll populate some values.
>>> df = pd.DataFrame(np.random.randn(5,2), index=index, columns=['a','b']) >>> new_index = df.index.append(pd.Index(['six'])) >>> new_columns = list(df.columns) + ['c'] >>> df = df.reindex(index=new_index, columns=new_columns) >>> df.loc['three'].c = 3 >>> df.loc['four'].c = 4 >>> df a b c one -0.042172 0.374922 NaN two -0.689523 1.411403 NaN three 0.332707 0.307561 3.0 four 0.426519 -0.425181 4.0 five -0.161095 -0.849932 NaN six NaN NaN NaN
Note that df.index.append() returns a new index and doesn't modify the existing index. Also, df.reindex() returns a new data frame that I assign back to the df variable.
At this point, our data frame has six rows. The last row is all NaNs, and all other rows except the third and the fourth have NaN in the "c" column. What can you do with missing data? Here are options:
Keep it (but it will not participate in computations).
Drop it (the result of the computation will not contain the missing data).
Replace it with a default value.
Keep the missing data --------------------- >>> df *= 2 >>> df a b c one -0.084345 0.749845 NaN two -1.379046 2.822806 NaN three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 five -0.322190 -1.699864 NaN six NaN NaN NaN Drop rows with missing data --------------------------- >>> df.dropna() a b c three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 Replace with default value -------------------------- >>> df.fillna(5) a b c one -0.084345 0.749845 5.0 two -1.379046 2.822806 5.0 three 0.665414 0.615123 6.0 four 0.853037 -0.850362 8.0 five -0.322190 -1.699864 5.0 six 5.000000 5.000000 5.0
If you just want to check if you have missing data in your data frame, use the isnull() method. This returns a boolean mask of your dataframe, which is True for missing values and False elsewhere.
>>> df.isnull() a b c one False False True two False False True three False False False four False False False five False False True six True True True
Manipulating Your Data
When you have a data frame, you often need to perform operations on the data. Let's start with a new data frame that has four rows and three columns of random integers between 1 and 9 (inclusive).
>>> df = pd.DataFrame(np.random.randint(1, 10, size=(4, 3)), columns=['a','b', 'c']) >>> df a b c 0 1 3 3 1 8 9 2 2 8 1 5 3 4 6 1
Now, you can start working on the data. Let's sum up all the columns and assign the result to the last row, and then sum all the rows (dimension 1) and assign to the last column:
>>> df.loc[3] = df.sum() >>> df a b c 0 1 3 3 1 8 9 2 2 8 1 5 3 21 19 11 >>> df.c = df.sum(1) >>> df a b c 0 1 3 7 1 8 9 19 2 8 1 14 3 21 19 51
You can also perform operations on the entire data frame. Here is an example of subtracting 3 from each and every cell:
>>> df -= 3 >>> df a b c 0 -2 0 4 1 5 6 16 2 5 -2 11 3 18 16 48
For total control, you can apply arbitrary functions:
>>> df.apply(lambda x: x ** 2 + 5 * x - 4) a b c 0 -10 -4 32 1 46 62 332 2 46 -10 172 3 410 332 2540
Merging Data
Another common scenario when working with data frames is combining and merging data frames (and series) together. Pandas, as usual, gives you different options. Let's create another data frame and explore the various options.
>>> df2 = df // 3 >>> df2 a b c 0 -1 0 1 1 1 2 5 2 1 -1 3 3 6 5 16
Concat
When using pd.concat, pandas simply concatenates all the rows of the provided parts in order. There is no alignment of indexes. See in the following example how duplicate index values are created:
>>> pd.concat([df, df2]) a b c 0 -2 0 4 1 5 6 16 2 5 -2 11 3 18 16 48 0 -1 0 1 1 1 2 5 2 1 -1 3 3 6 5 16
You can also concatenate columns by using the axis=1 argument:
>>> pd.concat([df[:2], df2], axis=1) a b c a b c 0 -2.0 0.0 4.0 -1 0 1 1 5.0 6.0 16.0 1 2 5 2 NaN NaN NaN 1 -1 3 3 NaN NaN NaN 6 5 16
Note that because the first data frame (I used only two rows) didn't have as many rows, the missing values were automatically populated with NaNs, which changed those column types from int to float.
It's possible to concatenate any number of data frames in one call.
Merge
The merge function behaves in a similar way to SQL join. It merges all the columns from rows that have similar keys. Note that it operates on two data frames only:
>>> df = pd.DataFrame(dict(key=['start', 'finish'],x=[4, 8])) >>> df key x 0 start 4 1 finish 8 >>> df2 = pd.DataFrame(dict(key=['start', 'finish'],y=[2, 18])) >>> df2 key y 0 start 2 1 finish 18 >>> pd.merge(df, df2, on='key') key x y 0 start 4 2 1 finish 8 18
Append
The data frame's append() method is a little shortcut. It functionally behaves like concat(), but saves some key strokes.
>>> df key x 0 start 4 1 finish 8 Appending one row using the append method() ------------------------------------------- >>> df.append(dict(key='middle', x=9), ignore_index=True) key x 0 start 4 1 finish 8 2 middle 9 Appending one row using the concat() ------------------------------------------- >>> pd.concat([df, pd.DataFrame(dict(key='middle', x=[9]))], ignore_index=True) key x 0 start 4 1 finish 8 2 middle 9
Grouping Your Data
Here is a data frame that contains the members and ages of two families: the Smiths and the Joneses. You can use the groupby() method to group data by last name and find information at the family level like the sum of ages and the mean age:
df = pd.DataFrame( dict(first='John Jim Jenny Jill Jack'.split(), last='Smith Jones Jones Smith Smith'.split(), age=[11, 13, 22, 44, 65])) >>> df.groupby('last').sum() age last Jones 35 Smith 120 >>> df.groupby('last').mean() age last Jones 17.5 Smith 40.0
Time Series
A lot of important data is time series data. Pandas has strong support for time series data starting with data ranges, going through localization and time conversion, and all the way to sophisticated frequency-based resampling.
The date_range() function can generate sequences of datetimes. Here is an example of generating a six-week period starting on 1 January 2017 using the UTC time zone.
>>> weeks = pd.date_range(start='1/1/2017', periods=6, freq='W', tz='UTC') >>> weeks DatetimeIndex(['2017-01-01', '2017-01-08', '2017-01-15', '2017-01-22', '2017-01-29', '2017-02-05'], dtype='datetime64[ns, UTC]', freq='W-SUN')
Adding a timestamp to your data frames, either as data column or as the index, is great for organizing and grouping your data by time. It also allows resampling. Here is an example of resampling every minute data as five-minute aggregations.
>>> minutes = pd.date_range(start='1/1/2017', periods=10, freq='1Min', tz='UTC') >>> ts = pd.Series(np.random.randn(len(minutes)), minutes) >>> ts 2017-01-01 00:00:00+00:00 1.866913 2017-01-01 00:01:00+00:00 2.157201 2017-01-01 00:02:00+00:00 -0.439932 2017-01-01 00:03:00+00:00 0.777944 2017-01-01 00:04:00+00:00 0.755624 2017-01-01 00:05:00+00:00 -2.150276 2017-01-01 00:06:00+00:00 3.352880 2017-01-01 00:07:00+00:00 -1.657432 2017-01-01 00:08:00+00:00 -0.144666 2017-01-01 00:09:00+00:00 -0.667059 Freq: T, dtype: float64 >>> ts.resample('5Min').mean() 2017-01-01 00:00:00+00:00 1.023550 2017-01-01 00:05:00+00:00 -0.253311
Plotting
Pandas supports plotting with matplotlib. Make sure it's installed: pip install matplotlib. To generate a plot, you can call the plot() of a series or a data frame. There are many options to control the plot, but the defaults work for simple visualization purposes. Here is how to generate a line graph and save it to a PDF file.
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2017', periods=1000)) ts = ts.cumsum() ax = ts.plot() fig = ax.get_figure() fig.savefig('plot.pdf')
Note that on macOS, Python must be installed as a framework for plotting with Pandas.
Conclusion
Pandas is a very broad data analytics framework. It has a simple object model with the concepts of series and data frame and a wealth of built-in functionality. You can compose and mix pandas functions and your own algorithms. 
Additionally, don’t hesitate to see what we have available for sale and for study in the marketplace, and don't hesitate to ask any questions and provide your valuable feedback using the feed below.
Data importing and exporting in pandas are very extensive too and ensure that you can integrate it easily into existing systems. If you're doing any data processing in Python, pandas belongs in your toolbox.
by Gigi Sayfan via Envato Tuts+ Code http://ift.tt/2gaPZ24
2 notes · View notes
swelldomains · 8 years ago
Text
How Not to Get Penalized by the Penguin 4.0 Real-Time Algorithm Filter
The long and also creepy silence that awaited the air since Penguin's last upgrade has actually lastly been broken. On 23rd September, Google dropped the bombshell as well as introduced that Penguin is now part of Google's core online search engine ranking algorithm.
And it has actually been well worth the 715-day delay! Both biggest changes in just how the filter (or ranking signal, relying on how you look at it) now functions are:
Penguin 4 is actual time. As soon as you eliminate your poor links and disavow those you can not, there's a little bit extra you can do compared to stick around and hope to the other G (god).
Penguin 4 is granular. Your entire website will not be struck out of the park even if a message on sooperarticles.com web links to your out-of-stock item.
A week after the statement, the court is still out on whether Penguin 4.0 has actually fully rolled out. And offered the sheer number of signals, filters as well as data rejuvenates that overlap, it is extremely hard to determine the impacts of Penguin (or any type of other algo variable, for that issue) on a particular LINK or SERP with absolute certainty.
Dr. Pete Meyers of Moz tweeted something extremely substantial:
" My intestine feeling is that we're not going to see a large Penguin 4.0 spike."
That has actually held good for a week until now. A poll on Search Engine Optimization Round Table backs it up:
It appears the first 3 models have succeeded in frightening the spammers away, as well as Search engine optimizations and web designers across the globe have actually transformed a brand-new leaf. Or, it may be that Google has just incorporated the code into the algorithm yet, and with a few tweaks, the effect may be more visible as Penguin strikes every one of Google's data centers as well as turn out over a prolonged amount of time, as when gazillions of URLs are re-crawled.
Eric Enge of Stone Holy place Consulting confirmed this:
So if you, like us, remain in the 72% that typically aren't seeing any kind of fireworks, read on to find out exactly how you can remain risk-free and also far from the warzone.
Don't expect to be warned
As with Panda, there won't be any kind of official word from Google on future updates. The apparent means to recognize if your site had been struck by a mathematical fine was to inspect if you had a website traffic drop on the date of the announcement.
Not happening anymore.
And no usage thronging those online forums or troubling @rustybrick for the newest news. You alone are accountable for your sites, SEO-related activities, ups, downs, and also recovery.
Build great links
“Create great content!”
You listen to that all over. It's exactly what Google has actually always informed us. It's exactly what the physician claimed when my child was born.
Okay, I exaggerate. But not a great deal. While a lot of us are active developing wonderful material these days, we have the tendency to give just a wee bit much less relevance to that old workhorse - PageRank. As well as the road to high PageRank is paved with authority web links. Google gently reminded us of that elephant in the space:
The assimilation of Penguin right into the core algorithm does not make link structure outdated. As a matter of fact, it is a lot more critical compared to ever. Web link building techniques and approaches that are viable or really work have continued to be practically the exact same in the previous half-decade or two. Going onward, the real-time and granular qualities of Penguin 4.0 will certainly make sure that link building also adheres to suit.
How so?
First, let's consider the "real-time" nature of Penguin. Search engine optimizations all over are delighted that their healing efforts (disavowing or removing low top quality web links) will certainly yield prompt outcomes. Nevertheless, the opposite is also true. One indiscreet project, a couple of gaudy sources, or an ill-considered eruption in web link velocity can cause your rankings to plummet at the wrong time and cost you a great deal of money.
That said, it will be much simpler to identify just what triggered the fall, and if you act swiftly, you'll be back in the video game within no time. SEO experts, consisting of (I would love to assume) us, have actually always kept that a "fine healing" doesn't mean you obtain your rankings back, it just suggests there's no spam holding you down any longer. To really restore your visibility and also perception quantity, you have to continually build authority links on the topic (as well as to the page) in question.
In the past, that used to be even more of a spray and hope technique - you proceeded developing brand-new, reliable links, but continued to be in a state of extensive suspense up until the following data refresh taken place. Our clients and site proprietors as a whole took these suggestions (which cost loan, initiative and also time to execute) with a pinch of salt and also a hardly discernible roll of eyes. We currently stand absolved: compared to the old days, Penguin 4 guarantees a basically instantaneous recovery (when Google recrawls as well as reindexes your web pages) - surges in website traffic can be clearly attached to positive initiatives, if any.
Now allowed's take a better take a look at the "granularity" element. As Google said in their announcement (and Gary Illyes later cleared up), Penguin will no much longer randomly demote rankings of entire sites in search engine result. Rather, Google will currently look at "spam signals" and devalue the real incoming links based upon the crappiness of private web pages or domains that are linking out to your site.
I believe this is an indicator that Google is reasonably positive they have actually achieved two points:
They have actually built a detailed working database of spammy domains with the assistance of all the disavow files submitted so far.
They are now able to use something like Moz's Spam Rating on a per-URL basis.
Taken along with the real-time part, this means you could prepare for not only the amount however additionally the top quality of the web links you develop. If you're a gadget seller, for instance, by all indicates up your link building stake with reviews, comparisons as well as whatnot (from anywhere you could obtain them) in the added to Cyber Monday. However get web links with more how-to and benefit-focused web content (on customer technology sites such as Engadget or Wired) the remainder of the year round.
Got my point?
Mind your keywords
One mistake even skilled Search engine optimizations as well as link builders make is to relate Penguin with negative links. Penguin fights much even more compared to spammy links. In its original blog post declaring the arrival of Penguin, Google specifically discussed (and also offered an example of) keyword stuffing ahead of link schemes. They went on to say:
What are these quality standards? Here you go:
So, no matter of whether the web content you produce gets on your website or the one from which you're building a link, ensure it's distinct, valuable, high quality and also provides a good customer experience. It's the same for your tags, markup and meta content. We do remember to produce material for humans now, yet frequently neglect to create titles, summaries and anchor text for them. Explore your abundant fragments, over-optimizing your landing pages, getting innovative with your affiliate programs, or enabling your target market to go crazy on your website, might all cause unplanned consequences.
I'm privy to first-hand empirical (yet not irrefutable) evidence that an URL could be hit by Penguin for certain search terms however not others, so I suggest you go as wide as feasible with your keyword targeting and material optimization.
Don't try to outwit Google
As Google continuouslies delegate progressively bigger pieces of its valued formula to equipment discovering, it faces a predicament that has been burglarizing programmers of their sleep for years: new code inevitably damages the one that presently functions. With AI, this trouble is compounded since, as numerous Googlers have actually admitted, even those who developed and established these formulas in activity don't completely recognize just how they function at the moment.
Gary revealed that numerous Google patents aren't being placed to great use as they weren't yet possible or compatible with existing systems.
We've seen exact match domain name names and specific match support message regularly cycle through now-it-works, now-it-doesn' t phases.
In the past, there have been reports of Google algorithm updates such as Pigeon being rolled back. Penguin 3 demotions will be gotten rid of in order to enable Penguin 4 to function its magic.
Bearing the above in mind, if you are advanced link builder that does not mind attempting out hats of varying color, the genuine time and granular residential properties of Penguin 4 might tempt you to try out stuff along the following lines:
Remember when you put those "borderline" domain names (that you just weren't fairly sure were injuring you) in your disavow file "simply to be safe?" Now is the moment to eliminate them and discover out. While Gary Illyes and also John Mueller both tweeted that Google's disavow referrals haven't changed, Gary later admitted "for Penguin particularly there's much less requirement" for a disavow documents (screenshot below).
Simultaneously shedding the candle at the other end, you can likewise attempt structure links that aren't exactly natural or editorial however you think "wouldn't do any harm." Certainly I don't indicate high rate visitor blogging, rerouted domains or repurposed microsites, ecommerce website cloning, or vibrant PBNs making use of big scale masking incorporated with mass-produced, fresh content.
Build connect to specific web pages or campaign-specific sections on your site. Integrate this with an inner connecting method that involves connecting to these assigned Links from the homepage or various other important web pages as well as tweaking the meta robots nofollow tag to regulate the circulation of web link juice (hello PageRank/anchor message sculpting) to them.
If any one of these methods obtain you in trouble with Penguin, you might swiftly stop as well as transform back.
But ... do not attempt these at home. The threat is absolutely ineffective. If you could consider it, Google has already thought about it. Someone has currently done it and the machines are already dealing with it.
Manual activities seem to have declined externally, but they're significantly alive and kicking, Google sent out 4,300,000 notices of hands-on activities to web designers in 2015. And they schedule an especially horrible revenge for methodical spammers:
You're much better off following their guidelines word for word, also if that white hat does not obtain you that evasive front row seat.
Coda
While Gary is positive that several webmasters will certainly enjoy as soon as Penguin 4 completes its turn out, Dr. Pete's prophecy makes more feeling to me:
" If you didn't see a Penguin recovery, I doubt you'll see it in the following couple of days."
Karma is a bitch but she does not bite everybody. Google continuouslies shuffle and also fuddle its SERPs in order to make "optimization" incredibly challenging, abstruse as well as unpredictable.
The just thing worth optimizing in the future is just what individuals take into the search box - aim to build a brand name and influence the discussion around it to such a degree that Google is entrusted to no choice but to go after it.
If you have any kind of inquiries about Google charges, or require any kind of aid recuperating from guidebook or mathematical activities, don't hesitate to obtain in contact. We have a group of SEO pros that comprehend the principles of Google's search formula, and also have the experience as well as tools needed to keep your traffic and exposure climbing.
1 note · View note
codingwiz-blog1 · 5 years ago
Text
Assignment-3
import pandas as pd import numpy as np
In [2]:
data = pd.read_csv("nesarc_pds.csv", low_memory=False)
In [3]:
data.columns=map(str.upper,data.columns)
In [4]:
data.head()
Out[4]:UNNAMED: 0ETHRACE2AETOTLCA2IDNUMPSUSTRATUMWEIGHTCDAYCMONCYEAR…SOLP12ABDEPHAL12ABDEPHALP12ABDEPMAR12ABDEPMARP12ABDEPHER12ABDEPHERP12ABDEPOTHB12ABDEPOTHBP12ABDEPNDSYMPTOMS
005140074033928.6135051482001…000000000NaN
1150.0014260456043638.6918451212002…000000000NaN
22531204212185779.03202523112001…000000000NaN
33541709917041071.754303992001…000000000NaN
44251709917044986.95237718102001…000000000NaN
5 rows × 3010 columns
Variables i’ll taking in this codebook:
CONSUMER = Drinking Status
S1Q2C2 = Raised by relatives before 18 age
SMOKER = Tobacco use status.
S3AQ52 = Age started smoking cigars everyday.
S2AQ19 = Age at start of period of Heaviest drinking.
NOTE :
Since, I’ve not used Spyder IDE therefore codes syntax have slight changes as compared to Video Lectures. Hope, you’ll understand each and every code, I’ve created comments for your reference wherever needed.
In [5]:
sub= data[['CONSUMER','S1Q2C2', 'SMOKER', 'S3AQ52', 'S2AQ19']]
In [6]:
sub1 = sub.copy()
In [7]:
sub1.head()
Out[7]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19
033
11321
233
32316
42318
CONSUMER :
            1. Current drinker             2. Ex-drinker             3. Lifetime Abstainer
Since this is counter intuitive we can change this to :
            0) Lifetime Abstrainer(One who never did drinking)             1) Ex-Drinker             2) Current Drinker
In [8]:
print("Before labels (CONSUMER) : ") print(sorted(sub1['CONSUMER'].unique()))
Before labels (CONSUMER) : [1, 2, 3]
In [9]:
def recode1(val):    if val==1:        return 2    if val==2:        return 1    if val==3:        return 0
In [10]:
sub1['CONSUMER_NEWL'] = sub1['CONSUMER'].apply(lambda x : recode1(x))
In [11]:
print("After labels (CONSUMER_NEWL) : ") print(sorted(sub1['CONSUMER_NEWL'].unique()))
After labels (CONSUMER_NEWL) : [0, 1, 2]
In [12]:
sub1.head()
Out[12]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWL
0330
113212
2330
323161
423181
SMOKER :
            1. Current user             2. Ex-user             3. Lifetime nonsmoker
Since, this is also counter intuitive we can change this to :
            0) Lifetime nonsmoker             1) Ex-user             2) Current user
In [13]:
print("Before labels (SMOKER) : ") print(sorted(sub1['SMOKER'].unique()))
Before labels (SMOKER) : [1, 2, 3]
In [14]:
#using above 'recode1' function here too. sub1['SMOKER_NEWL'] = sub1['SMOKER'].apply(lambda x : recode1(x))
In [15]:
sub1.head()
Out[15]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300
1132120
23300
3231610
4231810
In [16]:
print("After labels (SMOKER_NEWL) : ") print(sorted(sub1['SMOKER_NEWL'].unique()))
After labels (SMOKER_NEWL) : [0, 1, 2]
In [17]:
columnsTitles = ['CONSUMER','SMOKER', 'S1Q2C2', 'S3AQ52', 'S2AQ19', 'CONSUMER_NEWL','SMOKER_NEWL'] sub1 = sub1.reindex(columns=columnsTitles)
In [18]:
sub1.head()
Out[18]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300
1132120
23300
3231610
4231810
In [19]:
sub1['CONSUMER_NEWL'].value_counts(sort=False)
Out[19]:
0     8266 1     7881 2    26946 Name: CONSUMER_NEWL, dtype: int64
In [20]:
sub1['SMOKER_NEWL'].value_counts(sort=False)
Out[20]:
0    23901 1     8074 2    11118 Name: SMOKER_NEWL, dtype: int64
Managing variable - S3AQ52 (AGE STARTED SMOKING CIGARS EVERY DAY)
In [21]:
sub1['S3AQ52'].unique()
Out[21]:
array([' ', '21', '16', '20', '30', '40', '17', '25', '15', '35', '38',       '37', '26', '53', '24', '54', '18', '28', '55', '45', '32', '22',       '48', '39', '50', '34', '99', '36', '12', '60', '42', '51', '23',       '64', '47', '29', '19', '9', '70', '41', '52', '33', '46', '31',       '59', '8', '10', '44', '43', '65', '57', '69', '58', '27', '66',       '14', '84', '5', '11', '13', '49', '62', '63', '80', '56'],      dtype=object)
In [22]:
sub1[sub1['S3AQ52']==" "]
Out[22]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300
1132120
23300
3231610
4231810
……………………
430883300
43089131820
43090111722
430911122422
43092231710
42374 rows × 7 columns
In [23]:
#Converting blank values or People who never smoked to 0 sub1.loc[sub1['S3AQ52']==" ", 'S3AQ52'] = 0
In [24]:
#Converting String values of Dataframe to Numeric sub1['S3AQ52']= pd.to_numeric(sub1['S3AQ52'])
In [25]:
sub1['S3AQ52'].unique()
Out[25]:
array([ 0, 21, 16, 20, 30, 40, 17, 25, 15, 35, 38, 37, 26, 53, 24, 54, 18,       28, 55, 45, 32, 22, 48, 39, 50, 34, 99, 36, 12, 60, 42, 51, 23, 64,       47, 29, 19,  9, 70, 41, 52, 33, 46, 31, 59,  8, 10, 44, 43, 65, 57,       69, 58, 27, 66, 14, 84,  5, 11, 13, 49, 62, 63, 80, 56],      dtype=int64)
In [26]:
#Converting '99' (People who not answered this question in survey) to NaN. sub1.loc[sub1['S3AQ52']==99, 'S3AQ52'] = np.nan
In [27]:
sub1['S3AQ52'].unique()
Out[27]:
array([ 0., 21., 16., 20., 30., 40., 17., 25., 15., 35., 38., 37., 26.,       53., 24., 54., 18., 28., 55., 45., 32., 22., 48., 39., 50., 34.,       nan, 36., 12., 60., 42., 51., 23., 64., 47., 29., 19.,  9., 70.,       41., 52., 33., 46., 31., 59.,  8., 10., 44., 43., 65., 57., 69.,       58., 27., 66., 14., 84.,  5., 11., 13., 49., 62., 63., 80., 56.])
In [28]:
sub1.head()
Out[28]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
0330.000
1130.02120
2330.000
3230.01610
4230.01810
Now, Column S3AQ52 is managed and prepared.
Managing variable - S2AQ19 (AGE AT START OF PERIOD OF HEAVIEST DRINKING)
In [29]:
sub1['S2AQ19'].unique()
Out[29]:
array([' ', '21', '16', '18', '30', '17', '28', '43', '26', '23', '20',       '51', '19', '40', '35', '27', '42', '22', '15', '36', '25', '24',       '68', '99', '29', '52', '31', '33', '57', '38', '39', '32', '90',       '49', '50', '37', '34', '59', '63', '58', '55', '53', '79', '56',       '77', '41', '64', '8', '73', '6', '70', '13', '72', '44', '47',       '54', '14', '46', '48', '61', '65', '10', '76', '69', '5', '45',       '71', '60', '67', '12', '62', '74', '86', '66', '81', '82', '9',       '75', '83', '80', '78', '7', '87', '11', '85', '84', '91', '88'],      dtype=object)
In [30]:
#Pepole who are lifetime abstainer sub1[sub1['S2AQ19']==" "]
Out[30]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
0330.000
2330.000
22310.002
23330.000
26330.000
……………………
43070310.002
43071330.000
43072330.000
43082330.000
43088330.000
8266 rows × 7 columns
In [31]:
#Converting blank values or People who are lifetime abstainer to 0 sub1.loc[sub1['S2AQ19']==" ", 'S2AQ19'] = 0
In [32]:
#Converting String values of Dataframe to Numeric sub1['S2AQ19']= pd.to_numeric(sub1['S2AQ19'])
In [33]:
sub1['S3AQ52'].unique()
Out[33]:
array([ 0., 21., 16., 20., 30., 40., 17., 25., 15., 35., 38., 37., 26.,       53., 24., 54., 18., 28., 55., 45., 32., 22., 48., 39., 50., 34.,       nan, 36., 12., 60., 42., 51., 23., 64., 47., 29., 19.,  9., 70.,       41., 52., 33., 46., 31., 59.,  8., 10., 44., 43., 65., 57., 69.,       58., 27., 66., 14., 84.,  5., 11., 13., 49., 62., 63., 80., 56.])
In [34]:
sub1.head()
Out[34]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
0330.0000
1130.02120
2330.0000
3230.01610
4230.01810
Now, Column S2AQ19 is also managed and prepared.
Managing variable - S1Q2C2 (RAISED BY RELATIVES BEFORE AGE 18)
In [35]:
sub1['S1Q2C2'].unique()
Out[35]:
array([' ', '1', '2', '9'], dtype=object)
RAISED BY ADOPTIVE PARENTS BEFORE AGE 18 1. Yes 2. No 9. Unknown BL. NA, lived with biological parent(s) before age 18 We can change this to only 3 categories since we have to deal with only those people are were raised by relative: 1. ->1. Yes BL.NA & 2 ->0. No + NA(Lived with biological parent(s) before age of 18) 9. ->NaN. Those who didn’t answered this question in survey.In [36]:
sub1['S1Q2C2'].value_counts(dropna=False)
Out[36]:
    41679 1      649 2      553 9      212 Name: S1Q2C2, dtype: int64
In [47]:
#Converting blank values or People who are raised by parent(s) to 0. sub1.loc[sub1['S1Q2C2']==" ", 'S1Q2C2'] = 0
In [38]:
sub1['S1Q2C2'].value_counts(dropna=False)
Out[38]:
0    41679 1      649 2      553 9      212 Name: S1Q2C2, dtype: int64
In [39]:
#Converting value = 2 to 0. sub1.loc[sub1['S1Q2C2']=="2", 'S1Q2C2'] = 0
In [40]:
#Converting value = 9  to NaN. sub1.loc[sub1['S1Q2C2']=="9", 'S1Q2C2'] = np.nan
In [43]:
sub1['S1Q2C2'].unique()
Out[43]:
array([0, 1, nan], dtype=object)
In [45]:
sub1['S1Q2C2'].value_counts(dropna=False)
Out[45]:
0.0    42232 1.0      649 NaN      212 Name: S1Q2C2, dtype: int64
In [46]:
sub1.head()
Out[46]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300.0000
11300.02120
23300.0000
32300.01610
42300.01810
Now, Column S1Q2C2 is also managed and prepared.
And we can now remove cols CONSUMER and SMOKER for further analysis.
In [ ]:
0 notes
aidloindia-blog · 7 years ago
Text
CBSE Class 12th Information Practices Syllabus
Optional for the academic year 2019-20 and mandatory for the academic year 2020-21 onwards
CBSE Distribution of Marks
Tumblr media
4.1. Unit 1: Data Handling (DH-2) (80 Theory + 70 Practical) 4.1.1. Python Pandas  Advanced operations on Data Frames: pivoting, sorting, and aggregation  Descriptive statistics: min, max, mode, mean, count, sum, median, quartile, var  Create a histogram, and quantiles.  Function application: pipe, apply, aggregation (group by), transform, and apply map.  Reindexing, and altering labels. 4.1.2. Numpy  1D array, 2D array  Arrays: slices, joins, and subsets  Arithmetic operations on 2D arrays  Covariance, correlation and linear regression 4.1.3. Plotting with Pyplot  Plot bar graphs, histograms, frequency polygons, box plots, and scatter plots. 4.2 Unit 2: Basic Software Engineering (BSE) (25 Theory + 10 Practical)  Introduction to software engineering  Software Processes: waterfall model, evolutionary model, and component based model  Delivery models: incremental delivery, spiral delivery  Process activities: specification, design/implementation, validation, evolution  Agile methods: pair programming, and Scrum  Business use-case diagrams  Practical aspects: Version control system (GIT), and do case studies of software systems and build use-case diagrams 4.3. Unit 3: Data Management (DM-2) (20 Theory + 20 Practical)  Write a minimal Django based web application that parses a GET and POST request, and writes the fields to a file – flat file and CSV file.  Interface Python with an SQL database  SQL commands: aggregation functions, having, group by, order by. 4.4. Unit 4: Society, Law and Ethics (SLE-2) (15 Theory)  Intellectual property rights, plagiarism, digital rights management, and licensing (Creative Commons, GPL and Apache), open source, open data, privacy.  Privacy laws, fraud; cybercrime- phishing, illegal downloads, child pornography, scams; cyber forensics, IT Act, 2000.  Technology and society: understanding of societal issues and cultural changes induced by technology.  E-waste management: proper disposal of used electronic gadgets.  Identity theft, unique ids, and biometrics.  Gender and disability issues while teaching and using computers.  Role of new media in society: online campaigns, crowdsourcing, smart mobs  Issues with the internet: internet as an echo chamber, net neutrality, internet addiction  Case studies - Arab Spring, WikiLeaks, Bit coin
CBSE Practical Syllabus
Tumblr media
5.1. Data Management: SQL+web-server  Find the min, max, sum, and average of the marks in a student marks table.  Find the total number of customers from each country in the table (customer ID, customer Name, country) using group by.  Write a SQL query to order the (student ID, marks) table in descending order of the marks.  Integrate SQL with Python by importing MYSQL dB  Write a Django based web server to parse a user request (POST), and write it to a CSV file. 5.2. Data handling using Python libraries  Use map functions to convert all negative numbers in a Data Frame to the mean of all the numbers.  Consider a Data Frame, where each row contains the item category, item name, and expenditure. o Group the rows by the category, and print the total expenditure per category.  Given a Series, print all the elements that are above the 75th percentile.  Given a day’s worth of stock market data, aggregate it. Print the highest, lowest, and closing prices of each stock.  Given sample data, plot a linear regression line.  Take data from government web sites, aggregate and summarize it. Then plot it using different plotting functions of the PyPlot library. 5.3. Basic Software Engineering  Business use-case diagrams for an airline ticket booking system, train reservation system, stock exchange  Collaboratively write a program and manage the code with a version control system (GIT)
CBSE Project Syllabus
The aim of the class project is to create something that is tangible and useful. This should be done in groups of 2 to 3 students, and should be started by students at least 6 months before the submission deadline. The aim here is to find a real world problem that is worthwhile to solve. Students are encouraged to visit local businesses and ask them about the problems that they are facing. For example, if a business is finding it hard to create invoices for filing GST claims, then students can do a project that takes the raw data (list of transactions), groups the transactions by category, accounts for the GST tax rates, and creates invoices in the appropriate format. Students can be extremely creative here. They can use a wide variety of Python libraries to create user friendly applications such as games, software for their school, software for their disabled fellow students, and mobile applications, Of course to do some of this projects, some additional learning is required; this should be encouraged. Students should know how to teach themselves. If three people work on a project for 6 months, at least 500 lines of code is expected. The committee has also been made aware about the degree of plagiarism in such projects. Teachers should take a very strict look at this situation, and take very strict disciplinary action against students who are cheating on lab assignments, or projects, or using pirated software to do the same. Everything that is proposed can be achieved using absolutely free, and legitimate open source software. Read the full article
0 notes
shilkaren · 5 years ago
Link
pandas reindex
Insideaiml is one of the best platforms where you can learn Python, Data Science, Machine Learning, Artificial Intelligence & showcase your knowledge to the outside world.
0 notes
codingwiz-blog1 · 5 years ago
Text
Assignment-2
import pandas as pd import numpy as np
In [2]:
data = pd.read_csv("nesarc_pds.csv", low_memory=False)
In [3]:
data.columns=map(str.upper,data.columns)
In [4]:
data.head()
Out[4]:UNNAMED: 0ETHRACE2AETOTLCA2IDNUMPSUSTRATUMWEIGHTCDAYCMONCYEAR…SOLP12ABDEPHAL12ABDEPHALP12ABDEPMAR12ABDEPMARP12ABDEPHER12ABDEPHERP12ABDEPOTHB12ABDEPOTHBP12ABDEPNDSYMPTOMS
005140074033928.6135051482001…000000000NaN
1150.0014260456043638.6918451212002…000000000NaN
22531204212185779.03202523112001…000000000NaN
33541709917041071.754303992001…000000000NaN
44251709917044986.95237718102001…000000000NaN
5 rows × 3010 columns
Variables i’ll taking in this codebook:
CONSUMER = Drinking Status
S1Q2C2 = Raised by relatives before 18 age
SMOKER = Tobacco use status.
S3AQ52 = Age started smoking cigars everyday.
S2AQ19 = Age at start of period of Heaviest drinking.
NOTE :
Since, I’ve not used Spyder IDE therefore codes syntax have slight changes as compared to Video Lectures. Hope, you’ll understand each and every code, I’ve created comments for your reference wherever needed.
In [5]:
sub= data[['CONSUMER','S1Q2C2', 'SMOKER', 'S3AQ52', 'S2AQ19']]
In [6]:
sub1 = sub.copy()
In [7]:
sub1.head()
Out[7]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19
033
11321
233
32316
42318
CONSUMER :
            1. Current drinker             2. Ex-drinker             3. Lifetime Abstainer
Since this is counter intuitive we can change this to :
            0) Lifetime Abstrainer(One who never did drinking)             1) Ex-Drinker             2) Current Drinker
In [8]:
print("Before labels (CONSUMER) : ") print(sorted(sub1['CONSUMER'].unique()))
Before labels (CONSUMER) : [1, 2, 3]
In [9]:
def recode1(val):    if val==1:        return 2    if val==2:        return 1    if val==3:        return 0
In [10]:
sub1['CONSUMER_NEWL'] = sub1['CONSUMER'].apply(lambda x : recode1(x))
In [11]:
print("After labels (CONSUMER_NEWL) : ") print(sorted(sub1['CONSUMER_NEWL'].unique()))
After labels (CONSUMER_NEWL) : [0, 1, 2]
In [12]:
sub1.head()
Out[12]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWL
0330
113212
2330
323161
423181
SMOKER :
            1. Current user             2. Ex-user             3. Lifetime nonsmoker
Since, this is also counter intuitive we can change this to :
            0) Lifetime nonsmoker             1) Ex-user             2) Current user
In [13]:
print("Before labels (SMOKER) : ") print(sorted(sub1['SMOKER'].unique()))
Before labels (SMOKER) : [1, 2, 3]
In [14]:
#using above 'recode1' function here too. sub1['SMOKER_NEWL'] = sub1['SMOKER'].apply(lambda x : recode1(x))
In [15]:
sub1.head()
Out[15]:CONSUMERS1Q2C2SMOKERS3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300
1132120
23300
3231610
4231810
In [16]:
print("After labels (SMOKER_NEWL) : ") print(sorted(sub1['SMOKER_NEWL'].unique()))
After labels (SMOKER_NEWL) : [0, 1, 2]
In [17]:
columnsTitles = ['CONSUMER','SMOKER', 'S1Q2C2', 'S3AQ52', 'S2AQ19', 'CONSUMER_NEWL','SMOKER_NEWL'] sub1 = sub1.reindex(columns=columnsTitles)
In [18]:
sub1.head()
Out[18]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300
1132120
23300
3231610
4231810
In [19]:
sub1['CONSUMER_NEWL'].value_counts(sort=False)
Out[19]:
0     8266 1     7881 2    26946 Name: CONSUMER_NEWL, dtype: int64
In [20]:
sub1['SMOKER_NEWL'].value_counts(sort=False)
Out[20]:
0    23901 1     8074 2    11118 Name: SMOKER_NEWL, dtype: int64
Managing variable - S3AQ52 (AGE STARTED SMOKING CIGARS EVERY DAY)
In [21]:
sub1['S3AQ52'].unique()
Out[21]:
array([' ', '21', '16', '20', '30', '40', '17', '25', '15', '35', '38',       '37', '26', '53', '24', '54', '18', '28', '55', '45', '32', '22',       '48', '39', '50', '34', '99', '36', '12', '60', '42', '51', '23',       '64', '47', '29', '19', '9', '70', '41', '52', '33', '46', '31',       '59', '8', '10', '44', '43', '65', '57', '69', '58', '27', '66',       '14', '84', '5', '11', '13', '49', '62', '63', '80', '56'],      dtype=object)
In [22]:
sub1[sub1['S3AQ52']==" "]
Out[22]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300
1132120
23300
3231610
4231810
……………………
430883300
43089131820
43090111722
430911122422
43092231710
42374 rows × 7 columns
In [23]:
#Converting blank values or People who never smoked to 0 sub1.loc[sub1['S3AQ52']==" ", 'S3AQ52'] = 0
In [24]:
#Converting String values of Dataframe to Numeric sub1['S3AQ52']= pd.to_numeric(sub1['S3AQ52'])
In [25]:
sub1['S3AQ52'].unique()
Out[25]:
array([ 0, 21, 16, 20, 30, 40, 17, 25, 15, 35, 38, 37, 26, 53, 24, 54, 18,       28, 55, 45, 32, 22, 48, 39, 50, 34, 99, 36, 12, 60, 42, 51, 23, 64,       47, 29, 19,  9, 70, 41, 52, 33, 46, 31, 59,  8, 10, 44, 43, 65, 57,       69, 58, 27, 66, 14, 84,  5, 11, 13, 49, 62, 63, 80, 56],      dtype=int64)
In [26]:
#Converting '99' (People who not answered this question in survey) to NaN. sub1.loc[sub1['S3AQ52']==99, 'S3AQ52'] = np.nan
In [27]:
sub1['S3AQ52'].unique()
Out[27]:
array([ 0., 21., 16., 20., 30., 40., 17., 25., 15., 35., 38., 37., 26.,       53., 24., 54., 18., 28., 55., 45., 32., 22., 48., 39., 50., 34.,       nan, 36., 12., 60., 42., 51., 23., 64., 47., 29., 19.,  9., 70.,       41., 52., 33., 46., 31., 59.,  8., 10., 44., 43., 65., 57., 69.,       58., 27., 66., 14., 84.,  5., 11., 13., 49., 62., 63., 80., 56.])
In [28]:
sub1.head()
Out[28]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
0330.000
1130.02120
2330.000
3230.01610
4230.01810
Now, Column S3AQ52 is managed and prepared.
Managing variable - S2AQ19 (AGE AT START OF PERIOD OF HEAVIEST DRINKING)
In [29]:
sub1['S2AQ19'].unique()
Out[29]:
array([' ', '21', '16', '18', '30', '17', '28', '43', '26', '23', '20',       '51', '19', '40', '35', '27', '42', '22', '15', '36', '25', '24',       '68', '99', '29', '52', '31', '33', '57', '38', '39', '32', '90',       '49', '50', '37', '34', '59', '63', '58', '55', '53', '79', '56',       '77', '41', '64', '8', '73', '6', '70', '13', '72', '44', '47',       '54', '14', '46', '48', '61', '65', '10', '76', '69', '5', '45',       '71', '60', '67', '12', '62', '74', '86', '66', '81', '82', '9',       '75', '83', '80', '78', '7', '87', '11', '85', '84', '91', '88'],      dtype=object)
In [30]:
#Pepole who are lifetime abstainer sub1[sub1['S2AQ19']==" "]
Out[30]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
0330.000
2330.000
22310.002
23330.000
26330.000
……………………
43070310.002
43071330.000
43072330.000
43082330.000
43088330.000
8266 rows × 7 columns
In [31]:
#Converting blank values or People who are lifetime abstainer to 0 sub1.loc[sub1['S2AQ19']==" ", 'S2AQ19'] = 0
In [32]:
#Converting String values of Dataframe to Numeric sub1['S2AQ19']= pd.to_numeric(sub1['S2AQ19'])
In [33]:
sub1['S3AQ52'].unique()
Out[33]:
array([ 0., 21., 16., 20., 30., 40., 17., 25., 15., 35., 38., 37., 26.,       53., 24., 54., 18., 28., 55., 45., 32., 22., 48., 39., 50., 34.,       nan, 36., 12., 60., 42., 51., 23., 64., 47., 29., 19.,  9., 70.,       41., 52., 33., 46., 31., 59.,  8., 10., 44., 43., 65., 57., 69.,       58., 27., 66., 14., 84.,  5., 11., 13., 49., 62., 63., 80., 56.])
In [34]:
sub1.head()
Out[34]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
0330.0000
1130.02120
2330.0000
3230.01610
4230.01810
Now, Column S2AQ19 is also managed and prepared.
Managing variable - S1Q2C2 (RAISED BY RELATIVES BEFORE AGE 18)
In [35]:
sub1['S1Q2C2'].unique()
Out[35]:
array([' ', '1', '2', '9'], dtype=object)
RAISED BY ADOPTIVE PARENTS BEFORE AGE 18 1. Yes 2. No 9. Unknown BL. NA, lived with biological parent(s) before age 18 We can change this to only 3 categories since we have to deal with only those people are were raised by relative: 1. ->1. Yes BL.NA & 2 ->0. No + NA(Lived with biological parent(s) before age of 18) 9. ->NaN. Those who didn’t answered this question in survey.In [36]:
sub1['S1Q2C2'].value_counts(dropna=False)
Out[36]:
    41679 1      649 2      553 9      212 Name: S1Q2C2, dtype: int64
In [47]:
#Converting blank values or People who are raised by parent(s) to 0. sub1.loc[sub1['S1Q2C2']==" ", 'S1Q2C2'] = 0
In [38]:
sub1['S1Q2C2'].value_counts(dropna=False)
Out[38]:
0    41679 1      649 2      553 9      212 Name: S1Q2C2, dtype: int64
In [39]:
#Converting value = 2 to 0. sub1.loc[sub1['S1Q2C2']=="2", 'S1Q2C2'] = 0
In [40]:
#Converting value = 9  to NaN. sub1.loc[sub1['S1Q2C2']=="9", 'S1Q2C2'] = np.nan
In [43]:
sub1['S1Q2C2'].unique()
Out[43]:
array([0, 1, nan], dtype=object)
In [45]:
sub1['S1Q2C2'].value_counts(dropna=False)
Out[45]:
0.0    42232 1.0      649 NaN      212 Name: S1Q2C2, dtype: int64
In [46]:
sub1.head()
Out[46]:CONSUMERSMOKERS1Q2C2S3AQ52S2AQ19CONSUMER_NEWLSMOKER_NEWL
03300.0000
11300.02120
23300.0000
32300.01610
42300.01810
Now, Column S1Q2C2 is also managed and prepared.
And we can now remove cols CONSUMER and SMOKER for further analysis.
In [ ]:
0 notes
shilkaren · 5 years ago
Text
Reindexing in Python Pandas
Reindexing is used to change the row labels and column labels of a DataFrame.
It means to conform the data to match a given set of labels along a particular axis.
It helps us to perform Multiple operations through indexing like –
To insert missing value (NaN) markers in label locations where no              data for the label existed before.
To reorder the existing data to match a new set of labels.
 Example
import pandas as pd
import numpy as np
N=20
data = pd.DataFrame({
  'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
  'x': np.linspace(0,stop=N-1,num=N),
  'y': np.random.rand(N),
  'C': np.random.choice(['Low','Medium','High'],N).tolist(),
  'D': np.random.normal(100, 10, size=(N)).tolist()
})
#reindexing the DataFrame
data_reindexed = data.reindex(index=[0,2,5], columns=['A', 'C', 'B'])
print(data_reindexed)
 Output:
          A     C   B
0 2016-01-01  High NaN
2 2016-01-03   Low NaN
5 2016-01-06  High NaN
 How to Reindex to Align with Other Objects?
Lets us consider if you want to take an object and reindex its axes and labeled the same as another object.
Take an example to get better understanding
 Example:
import pandas as pd
import numpy as np
data1 = pd.DataFrame(np.random.randn(10,3),columns=['column1','column2','column3'])
data2 = pd.DataFrame(np.random.randn(7,3),columns=['column1','column2','column3'])
data1 = data1.reindex_like(data2)
print(data1)
Output
   column1   column2   column3
0 0.271240  0.201199 -0.151743
1 -0.269379  0.262300 0.019942
2 0.685737 -0.233194 -0.652832
3 -1.416394 -0.587026  1.065789
4 -0.590154 -2.194137  0.707365
5 0.393549  1.801881 -2.529611
6 0.062660 -0.996452 -0.029740
Note − Here, the data1 DataFrame is altered and reindexed like data2. If the column names do not should be matched NaN will be added for the entire column label.
How to Fill values while ReIndexing?
We can also fill the missing value while we are reindexing the dataset.
Pandas reindex() method takes an optional parameter which helps to fill the values. The parameters are as follows-
·        pad/ffill – It will fill values in forward direction.
·        bfill/backfill – It will fill the values backward direction.
·        nearest – It will fill the values from the nearest index values.
 Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print(df2.reindex_like(df1))
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='ffill'))
Output
      col1      col2      col3
0 -1.046918  0.608691 1.081329
1 -0.396384 -0.176895 -1.896393
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill:
      col1      col2      col3
0 -1.046918  0.608691 1.081329
1 -0.396384 -0.176895 -1.896393
2 -0.396384 -0.176895 -1.896393
3 -0.396384 -0.176895 -1.896393
4 -0.396384 -0.176895 -1.896393
5 -0.396384 -0.176895 -1.896393
Note – In the above example the last four rows are padded.
 How to Limit on Filling values while Reindexing?
Reindex() function also takes a parameter “limit” which is used to maximum count of the consecutive matches.
Let’s understand with an example-
Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])
# Padding NAN's
print(df2.reindex_like(df1))
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print(df2.reindex_like(df1,method='ffill',limit=1))
 Output
      col1      col2     col3
0 0.824697  0.122557 -0.156242
1 0.528174 -1.140847 -1.158778
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Data Frame with Forward Fill limiting to 1:
      col1      col2      col3
0 0.824697  0.122557 -0.156242
1 0.528174 -1.140847 -1.158778
2 0.528174 -1.140847 -1.158778
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN
Note – In the above we can observe that only the 7th row is filled by the preceding 6th row. Then, the rows are left as they are.
How to Rename in Python?
Python provides a rename() method which allows us to relabel an axis based on the same mapping (a dict or a Series) or an arbitrary function.
Let’s take an example to understand
Example
import pandas as pd
import numpy as np
data1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print(data1)
print ("After renaming the rows and columns:")
print(data1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'mango'}))
Output
      col1      col2      col3
0 0.047170  0.378306 -1.198150
1 1.183208 -2.195630 -0.798192
2 0.256581  0.627994 -0.674260
3 0.240853  1.677340  1.497613
4 0.820688  0.920151 -1.431485
5 -0.010474 -0.228373 -0.392640
After renaming the rows and columns:
             c1        c2     col3
apple   0.047170  0.378306 -1.198150
banana 1.183208 -2.195630 -0.798192
mango   0.256581  0.627994 -0.674260
3       0.240853  1.677340  1.497613
4       0.820688  0.920151 -1.431485
5     -0.010474 -0.228373 -0.392640
This rename() method provides an inplace named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place.
Insideaiml is one of the best platforms where you can learn Python, Data Science, Machine Learning, Artificial Intelligence & showcase your knowledge to the outside world.
0 notes